03 - Logistic Regression

University of San Francisco, MSMI-608

Pre-Class Code Assignment Instructions

In this semester, I am going to ask you to do a fair bit of work before coming to class. This will make our class time shorter, more manageable, and hopefully less boring.

I am also going to use this as an opportunity for you to directly earn grade points for your effort/labor, rather than “getting things right” on an exam.

Therefore, I will ask you to work through the posted slides on Canvas before class. Throughout the slides, I will post Pre-Class Questions for you to work through in R. These will look like this:

Pre-Class Q1

In R, please write code that will read in the .csv from Canvas called sf_listings_2312.csv. Assign this the name bnb.

You will then write your answer in a .r script:

Click to show code and output
# Q1
#bnb <- read.csv("sf_listings_2312.csv")

Important:

To earn full points, you need to organize your code correctly. Specifically, you need to:

  • Answer questions in order.
    • If you answer them out of order, just re-arrange the code after.
  • Preface each answer with a comment (# Q1/# Q2/# Q3) that indicates exactly which question you are answering.
    • Please just write the letter Q and the number in this comment.
  • Make sure your code runs on its own, on anyone’s computer.
    • To do this, I would always include rm(list = ls()) at the top of every .r script. This will clean everything from the environment, allowing you to see if this runs on my computer.

Handing this in:

  • You must submit this to Canvas before 9:00am on the day of class. Even if class starts at 10:00am that day, these are always due at 9:00.
  • You must submit this code as a .txt file. This is because Canvas cannot present .R files to me in SpeedGrader. To save as .txt:
    • Click File -> New File -> Text File
    • Copy and paste your completed code to that new text file.
    • Save the file as firstname_lastname_module.txt
      • For example, my file for Module 01 would be matt_meister_01.txt
      • My file for module 05 would be matt_meister_05.txt

Grading:

  • I will grade these for completion.
  • You will receive 1 point for every question you give an honest attempt to answer
  • Your grade will be the number of questions you answer, divided by the total number of questions.
    • This is why it is important that you number each answer with # Q1.
    • Any questions that are not numbered this way will be graded incomplete, because I can’t find them.
  • You will receive a 25% penalty for submitting these late.
  • I will post my solutions after class.

Setup

Load in these packages. If you do not have them, you will need to install them.

  • e.g., install.packages("dplyr")
library(stringr)
library(dplyr)
library(ggplot2)
library(tidyr)

Read in airline_satisfaction.csv from Canvas, and assign it the name data:

Click to show code and output
data <- read.csv('airline_satisfaction.csv')

And read in airbnb_logistic.csv from Canvas, and assign it the name airbnb.data:

Click to show code and output
airbnb.data <- read.csv('airbnb_logistic.csv')

Binary Outcomes

Thus far, we’ve spent a lot of time talking about “linear” regression. Linear regression is called linear because it calculates the best equation of a line to fit our data. Effectively–what is the best line we can draw through our data points?

Now, we will introduce logistic regression. Logistic regression is similar to linear regression, in that we are drawing a line through points in our data, which we do to predict some outcome. However, logistic regression is used in situations where our outcome is either a 0 or a 1 (or a yes/no, or a TRUE/FALSE). With a logistic regression, we are predicting the probability of an outcome. (In linear regression, we are predicting the level of an outcome.)

For example, consider last week’s Airbnb data. We might be interested in staying at an Airbnb in North Beach, but terrified that the density of apartments and restaurants–which attracted us in the first place–also means that Airbnbs close to the center of North Beach have a rat problem. Being well-informed statisticians (and newly skilled at text analysis), we could analyze whether reviews are more likely to mention rats if they are closer to the center of North Beach.1

Pre-Class Q1

In the data set airbnb_logistic.csv, you have observations from reviews of the distance of an Airbnb from the center of North Beach (in miles), and whether the review mentions rats. Remember that we can only use a logistic regression on outcomes that are binary. So to start, create a histogram for each variable (distance and rats), to check that rats is binary, and that distance is not crazy.

Click to show code and output
hist(airbnb.data$rats)

hist(airbnb.data$distance)

Pre-Class Q2

Now, using a linear regression, test whether distance is related to rats. Then, plot this with ggplot(), using geom_smooth(method = "lm").

Click to show code and output
summary(lm(data = airbnb.data, rats ~ distance))

Call:
lm(formula = rats ~ distance, data = airbnb.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.49174 -0.21301 -0.02665  0.15364  0.73748 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.56044    0.05812   9.643 7.21e-16 ***
distance    -0.10907    0.01374  -7.940 3.40e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2889 on 98 degrees of freedom
Multiple R-squared:  0.3915,    Adjusted R-squared:  0.3853 
F-statistic: 63.05 on 1 and 98 DF,  p-value: 3.399e-12
ggplot(data = airbnb.data, 
       aes( x = distance, 
            y = rats)) +
  geom_smooth(method = 'lm')

You should see that mention of rats decreases with distance, and that this is statistically significant:

Because our DV is either 0 or 1, the coefficient can be interpreted as a change in probability of the outcome with a one-unit change in the IV. And each point on the line shows us the predicted probability of rats at a given distance. In cases like this (i.e., with binary dependent variables), linear regressions are also referred to as “linear probability models”. This is good enough to tell us that rats are more prevalent as we get closer to North Beach. However, there are two oddities you should notice with this relationship:

First, linearity:

A linear regression always predicts straight lines. As a result, it predicts that the relationship between rats and distance is perfectly linear. This regression is saying that the likelihood of encountering rats changes by -10.9%2 with each mile further from North Beach.

Sometimes linearity is correct. Sometimes, people’s behavior does change according to some constant relationship with a predictor. But, often, relationships are not linear.

One common example of this in marketing is the relationship between advertising spending and sales. When we increase ad spending, sales also increase. However, this is not linear. When we go from spending $0 to $100, we should see a big bump in sales. But if we go from spending $100,000 to $100,100, it is very unlikely we see any difference at all. This specific case is referred to as “diminishing marginal returns”, and applies to a lot of contexts. You can read more at this link.

So in our example, we might be concerned that the relationship between distance and rats is also not linear. We’ll talk about when this is a problem (vs when linearity is “fine enough”) later.

Second, impossible predictions:

Look back up at the plot you made with your linear prediction. Remember, rats is binary. It can be 0 or 1. Therefore, our predictions are probabilities. They should be between 0 and 1–0 corresponding to “no chance”, and 1 corresponding to “certainty”. Let’s test if we satisfy that here.

Pre-Class Q3

Use the predict() function to predict the probability of rats at distances of 0, 1, and 6 miles, using your linear regression. Remember that predict() takes three arguments: (1) An object (which is our model, in this case a linear regression), (2) newdata (which is a data.frame with the exact same variable names as in our regression), and (3) type (which we will always set to response in this class).

Click to show code and output
linear_model <- lm(data = airbnb.data, rats ~ distance)
new_data <- data.frame(distance = c(0, 1, 6))
predict(object = linear_model,
        newdata = new_data,
        type = "response")
          1           2           3 
 0.56044316  0.45137074 -0.09399136 

You might have expected this after looking at the plot above. When distance is relatively close, your predictions are fine–they are between 0 and 1. But, when distance is 6, your prediction is negative! This happens because a linear regression has to fit a straight line to our data. So, while it fits the best line it can, that linearity can cause it to make impossible predictions–probabilities above 1 and below 0.

So maybe we need something better. In class and in your homework, we will see that, often, linear regression is fine in practice. But before we do that, let’s consider the alternative–logistic regression.

Logistic Regression

Logistic regression models the probability of an outcome using an S-shaped curve called the logistic function. This S-shape ensures that probabilities are between 0 and 1. To run a logistic regression in R, we use a function very similar to lm(), with two changes. For example, to test our question above using linear regression, we wrote:

lm(data = airbnb.data, rats ~ distance)

But to test our question using logistic regression, we write:

glm(data = airbnb.data, rats ~ distance, family = "binomial")

The two changes are:

  • Using glm(), which stands for Generalized Linear Model, and allows for different forms of regression.
  • Specifying family = "binomial", which tells R that the dependent variable is binary, so it should use the logistic function.

Additionally, when we plot this using ggplot(), we change the code in just two ways again. Instead of writing geom_smooth(method = 'lm'), we:

  • Change method = 'lm' to method = 'glm'
  • Inside of geom_smooth() (after method = 'glm'), we add the argument method.args = "binomial"
  • Note that sometimes you will also have to add the argument formula = y ~ x

Pre-Class Q4

Now, using a logistic regression, test whether distance is related to rats. Then, plot this with ggplot(), using geom_smooth(method = "glm", method.args = "binomial").

Click to show code and output
summary(glm(data = airbnb.data, rats ~ distance, family = "binomial"))

Call:
glm(formula = rats ~ distance, family = "binomial", data = airbnb.data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.8001     0.9715   2.882 0.003949 ** 
distance     -2.5101     0.7139  -3.516 0.000438 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 87.934  on 99  degrees of freedom
Residual deviance: 30.340  on 98  degrees of freedom
AIC: 34.34

Number of Fisher Scoring iterations: 8
ggplot(data = airbnb.data, 
       aes( x = distance, 
            y = rats)) +
  geom_smooth(method = "glm", 
              method.args = "binomial")

In this case, the S-shape is backwards, because rats decrease with distance. But you should notice that this model tells a very different story, and makes very different predictions to the linear model. This is super obvious if you look at the two plots side-by-side, including points:

Obviously, the linear regression was not flexible to pick up on the fact that this relationship is actually not linear. The difference in rats from 0-1 mile is much larger than the distance from 3-4, or even 3-6! There are also no rats further than 3 miles away, and very few at 2 miles away. The logistic regression can fit itself to all of this information, whereas the linear regression cannot. Ideally, this makes better predictions.

Pre-Class Q5

Use the predict() function to predict the probability of rats at distances of 0, 1, and 6 miles, using your logistic regression. Remember that predict() takes three arguments: (1) An object (which is our model, in this case a logistic regression), (2) newdata (which is a data.frame with the exact same variable names as in our regression), and (3) type (which we will always set to response in this class).

Click to show code and output
logistic_model <- glm(data = airbnb.data, rats ~ distance, family = "binomial")
new_data <- data.frame(distance = c(0, 1, 6))
predict(object = logistic_model,
        newdata = new_data,
        type = "response") |>
  round(3)
    1     2     3 
0.943 0.572 0.000 

Conveniently, we don’t get any impossible predictions!

Coefficients in a logistic regression

So far, logistic regression has been all good news. Now is when the bad begins. Look at the coefficients we got for that logistic regression:

Logistic Regression Coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.80 0.972 2.882 0.004
distance -2.51 0.714 -3.516 0.000

How is the intercept above 1 if the whole curve is between 0 and 1? And how is the coefficient for distance -2.51 if a one-unit change in distance clearly does not lead to a 2.51 unit decrease in rats?

Unfortunately, explaining this will take some math.

The key to understanding these results is to be aware that both the intercept and “slope” in a logistic regression are expressed in log-odds, not probabilities.

  • Odds are the ratio of the probability of success (\(p\)) to failure (\(1-p\)).
    • \(\text{odds} = \frac{\text{p}}{1-\text{p}}\)
    • The log-odds is the natural logarithm of these odds (\(\text{log-odds} = \log(\frac{\text{p}}{1-\text{p}})\)).
      • We’ve seen the natural logarithm before, when we used log() in R.
  • Meanwhile, probability is equal to \(\text{p} = \frac{\text{odds}}{1 + \text{odds}}\)
    • Starting with:
      • \(\text{odds} = \frac{p}{1-p}\)
    • Multiply both sides by \(1-p\) to eliminate the denominator:
      • \(\text{odds} \times (1-p) = p\)
    • Distribute \(\text{odds}\):
      • \(\text{odds} - \text{odds} \times p = p\)
    • Rearrange to isolate \(p\):
      • \(\text{odds} = p + \text{odds} \times p\)
    • Factor \(p\) out of the right-hand side:
      • \(\text{odds} = p \times (1 + \text{odds})\)
    • Divide through by \(1 + \text{odds}\):
      • \(p = \frac{\text{odds}}{1 + \text{odds}}\)

Why is the intercept above 1?

Conveniently, the intercept is still the value of something when the independent variable is 0. But this is not the predicted value of the dependent variable (as it is with linear regression). With logistic regression, everything is in log-odds, including the intercept.

The intercept is the log-odds of the outcome (\(\text{rats} = 1\)) when the independent variable ($ $) is 0.

In our example

The intercept is 2.8. Remember also that probability equals \(p = \frac{\text{odds}}{1 + \text{odds}}\). And, that log-odds are \(\text{log}(\text{odds})\). That means that when distance is zero:

\(\text{log}(\text{odds}) = 2.8\)

You can convert log-odds to odds by exponentiation. In R, this is the function exp():

\(\text{odds} = e^{2.8}\) \(\text{odds} = 16.44\)

We can sub this into the probability equation:

\(p = \frac{\text{16.44}}{1+\text{16.44}}\)

And then just regular ole’ math:

\(p = \frac{\text{16.44}}{\text{17.44}}\) \(p \approx 0.943\)

When distance is zero, the predicted probability of a review mentioning a rat is \(\approx\) 94.3%.

Note:

This math is not a lot of fun. Here is a function that will take log-odds, and give you probability. Save this function somewhere and source it when you need to use log-odds.

prob_from_log_odds <- function(log_odds){
  odds <- exp(log_odds) # Exponentiate first
  prob <- odds / (1 + odds)
  return(prob) 
}

Here is how you can use it:

# Provide regression coefficient directly
prob_from_log_odds(logistic_model$coefficients[[1]])
[1] 0.942682
# Input a number
prob_from_log_odds(16.44)
[1] 0.9999999

How can the coefficient for distance be -2.51 if a one-unit change doesn’t lead to a 2.51 unit decrease in rats?

This happens because, again, the coefficient is expressed in log-odds, not the probability scale. A coefficient of -2.51 means that for a one-unit increase in distance, the log-odds of rats decreases by 2.51.

In terms of odds, the change (expressed by \(\Delta\)) is multiplicative:

\(\Delta \text{log}(\text{odds}) = -2.51\) \(\Delta \text{odds} = e^{-2.51}\) \(\Delta \text{odds} \approx 0.081\)

This means the odds of rats are reduced by 8.1% of the prior odds for each one-unit increase in distance. This is what allows the logistic regression to curve.

My suggestion

Logistic regression coefficients are quite confusing if you’re not well-versed in this math. This interpretability is their biggest flaw. Instead of trying to understand them in detail, I would use the predict() function in R, to see how your predicted probabilities change at different levels of your independent variable.

What you need to take away from this is not the specifics of the math, but the ideas that:

  1. Logistic regression provides coefficients in log-odds, not probabilities.
  2. This allows the logistic regression function to not hit 0 or 1.
  3. This also allows the logistic regression function to curve, identifying non-linear relationships between our dependent variable and independent variable.

Airline Satisfaction

The data set airline_satisfaction.csv on Canvas contains the results of a survey of airline customers. These responses were collected immediately after getting off a plane trip. The airline collected some personal and travel information from each customer, and then asked about their satisfaction with a number of elements of their trip.

In class, we will practice testing logistic regressions, comparing them to linear regressions, and show how we can make predictions of new data with these functions. In the pre-class workshop, I would like you to orient yourself with the data.

Pre-Class Q6

Our dependent variable is satisfaction. Summarise this variable, and see what (if anything) we have to do in order to use it in a regression.

Click to show code and output
summary(data$satisfaction)
   Length     Class      Mode 
    25976 character character 
unique(data$satisfaction)
[1] "satisfied"               "neutral or dissatisfied"
#This variable is a character, so we are going to have to turn it into 0/1
data$satisfied <- ifelse(data$satisfaction == "satisfied", 1, 0)

Pre-Class Q7

What percentage of customers are satisfied?

Click to show code and output
mean(data$satisfied)
[1] 0.4389821

Pre-Class Q8

Now, let’s get some ideas for a potential hypothesis. You can look at the other variables we have with str() and summary().

Click to show code and output
str(data)
'data.frame':   25976 obs. of  26 variables:
 $ X                                : int  0 1 2 3 4 5 6 7 8 9 ...
 $ id                               : int  19556 90035 12360 77959 36875 39177 79433 97286 27508 62482 ...
 $ Gender                           : chr  "Female" "Female" "Male" "Male" ...
 $ Customer.Type                    : chr  "Loyal Customer" "Loyal Customer" "disloyal Customer" "Loyal Customer" ...
 $ Age                              : int  52 36 20 44 49 16 77 43 47 46 ...
 $ Type.of.Travel                   : chr  "Business travel" "Business travel" "Business travel" "Business travel" ...
 $ Class                            : chr  "Eco" "Business" "Eco" "Business" ...
 $ Flight.Distance                  : int  160 2863 192 3377 1182 311 3987 2556 556 1744 ...
 $ Inflight.wifi.service            : int  5 1 2 0 2 3 5 2 5 2 ...
 $ Departure.Arrival.time.convenient: int  4 1 0 0 3 3 5 2 2 2 ...
 $ Ease.of.Online.booking           : int  3 3 2 0 4 3 5 2 2 2 ...
 $ Gate.location                    : int  4 1 4 2 3 3 5 2 2 2 ...
 $ Food.and.drink                   : int  3 5 2 3 4 5 3 4 5 3 ...
 $ Online.boarding                  : int  4 4 2 4 1 5 5 4 5 4 ...
 $ Seat.comfort                     : int  3 5 2 4 2 3 5 5 5 4 ...
 $ Inflight.entertainment           : int  5 4 2 1 2 5 5 4 5 4 ...
 $ On.board.service                 : int  5 4 4 1 2 4 5 4 2 4 ...
 $ Leg.room.service                 : int  5 4 1 1 2 3 5 4 2 4 ...
 $ Baggage.handling                 : int  5 4 3 1 2 1 5 4 5 4 ...
 $ Checkin.service                  : int  2 3 2 3 4 1 4 5 3 5 ...
 $ Inflight.service                 : int  5 4 2 1 2 2 5 4 3 4 ...
 $ Cleanliness                      : int  5 5 2 4 4 5 3 3 5 4 ...
 $ Departure.Delay.in.Minutes       : int  50 0 0 0 0 0 0 77 1 28 ...
 $ Arrival.Delay.in.Minutes         : num  44 0 0 6 20 0 0 65 0 14 ...
 $ satisfaction                     : chr  "satisfied" "satisfied" "neutral or dissatisfied" "satisfied" ...
 $ satisfied                        : num  1 1 0 1 1 1 1 1 1 1 ...
summary(data)
       X               id            Gender          Customer.Type     
 Min.   :    0   Min.   :    17   Length:25976       Length:25976      
 1st Qu.: 6494   1st Qu.: 32170   Class :character   Class :character  
 Median :12988   Median : 65320   Mode  :character   Mode  :character  
 Mean   :12988   Mean   : 65006                                        
 3rd Qu.:19481   3rd Qu.: 97584                                        
 Max.   :25975   Max.   :129877                                        
                                                                       
      Age        Type.of.Travel        Class           Flight.Distance
 Min.   : 7.00   Length:25976       Length:25976       Min.   :  31   
 1st Qu.:27.00   Class :character   Class :character   1st Qu.: 414   
 Median :40.00   Mode  :character   Mode  :character   Median : 849   
 Mean   :39.62                                         Mean   :1194   
 3rd Qu.:51.00                                         3rd Qu.:1744   
 Max.   :85.00                                         Max.   :4983   
                                                                      
 Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking
 Min.   :0.000         Min.   :0.000                     Min.   :0.000         
 1st Qu.:2.000         1st Qu.:2.000                     1st Qu.:2.000         
 Median :3.000         Median :3.000                     Median :3.000         
 Mean   :2.725         Mean   :3.047                     Mean   :2.757         
 3rd Qu.:4.000         3rd Qu.:4.000                     3rd Qu.:4.000         
 Max.   :5.000         Max.   :5.000                     Max.   :5.000         
                                                                               
 Gate.location   Food.and.drink  Online.boarding  Seat.comfort  
 Min.   :1.000   Min.   :0.000   Min.   :0.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
 Median :3.000   Median :3.000   Median :4.000   Median :4.000  
 Mean   :2.977   Mean   :3.215   Mean   :3.262   Mean   :3.449  
 3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
                                                                
 Inflight.entertainment On.board.service Leg.room.service Baggage.handling
 Min.   :0.000          Min.   :0.000    Min.   :0.00     Min.   :1.000   
 1st Qu.:2.000          1st Qu.:2.000    1st Qu.:2.00     1st Qu.:3.000   
 Median :4.000          Median :4.000    Median :4.00     Median :4.000   
 Mean   :3.358          Mean   :3.386    Mean   :3.35     Mean   :3.633   
 3rd Qu.:4.000          3rd Qu.:4.000    3rd Qu.:4.00     3rd Qu.:5.000   
 Max.   :5.000          Max.   :5.000    Max.   :5.00     Max.   :5.000   
                                                                          
 Checkin.service Inflight.service  Cleanliness    Departure.Delay.in.Minutes
 Min.   :1.000   Min.   :0.000    Min.   :0.000   Min.   :   0.00           
 1st Qu.:3.000   1st Qu.:3.000    1st Qu.:2.000   1st Qu.:   0.00           
 Median :3.000   Median :4.000    Median :3.000   Median :   0.00           
 Mean   :3.314   Mean   :3.649    Mean   :3.286   Mean   :  14.31           
 3rd Qu.:4.000   3rd Qu.:5.000    3rd Qu.:4.000   3rd Qu.:  12.00           
 Max.   :5.000   Max.   :5.000    Max.   :5.000   Max.   :1128.00           
                                                                            
 Arrival.Delay.in.Minutes satisfaction         satisfied    
 Min.   :   0.00          Length:25976       Min.   :0.000  
 1st Qu.:   0.00          Class :character   1st Qu.:0.000  
 Median :   0.00          Mode  :character   Median :0.000  
 Mean   :  14.74                             Mean   :0.439  
 3rd Qu.:  13.00                             3rd Qu.:1.000  
 Max.   :1115.00                             Max.   :1.000  
 NA's   :83                                                 

Pre-Class Q9

Create a scatterplot to visualize the relationship between Flight.Distance and satisfied.

Click to show code and output
ggplot(data = data, aes(x = Flight.Distance, y = satisfied)) +
  geom_point() +
  geom_smooth(method = "glm", 
              method.args = "binomial") +
  coord_cartesian(ylim = c(-.1,1.1))

Pre-Class Q10

It looks like Flight.Distance might have some outliers. We can see if this is the case with a histogram.

Click to show code and output
hist(data$Flight.Distance)

Pre-Class Q11

To fix Flight.Distance, let’s log-transform it with log(). Assign this to the variable data$log_Flight.Distance.

Click to show code and output
data$log_Flight.Distance <- log(data$Flight.Distance)
hist(data$log_Flight.Distance)

If you stop here

You are ready for class. If you want extra practice and full points, try these questions

Pre-Class Q12

You might have noticed that the logistic regression fit our airbnb.data nicely, but that the coefficients it gave us were… gross. In class, I will talk about how in practice, a linear regression is often good enough, because our relationships are often close enough to linear. True, strong S-shapes are rare in nature, because they suggest that our outcome suddenly flips/turns on at some level.

However, even when our data do follow an S-shape, we can often still manipulate our variables to make a linear regression make sense. For extra practice and a challenge, I would like you to try manipulating the variable distance so that a linear regression will make sense and fit the data. Try doing this without looking at my answer first, and submit something of your own creation!

Option 1: Click to show code and output
<div> I am eyeballing distance = 1 as a point where the likelihood of rats changes sharply. I'll use this point to split our data. </div>
airbnb.data$distance_bin <- ifelse(airbnb.data$distance > 1, 1, 0)
linear_model_bin <- lm(data = airbnb.data, rats ~ distance_bin)
summary(linear_model_bin)

Call:
lm(formula = rats ~ distance_bin, data = airbnb.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.86667 -0.03529 -0.03529 -0.03529  0.96471 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.86667    0.05611   15.45   <2e-16 ***
distance_bin -0.83137    0.06086  -13.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2173 on 98 degrees of freedom
Multiple R-squared:  0.6557,    Adjusted R-squared:  0.6522 
F-statistic: 186.6 on 1 and 98 DF,  p-value: < 2.2e-16
ggplot(data = airbnb.data, 
       aes( x = distance_bin, 
            y = rats)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_cartesian(ylim = c(-.1,1.1))

new_data <- data.frame(distance = c(0, 1, 6))
new_data$distance_bin <- ifelse(new_data$distance > 1, 1, 0)
predict(object = linear_model_bin,
        newdata = new_data,
        type = "response") |>
  round(3)
    1     2     3 
0.867 0.867 0.035 
Option 2: Click to show code and output
<div> If we think there is a linear relationship in some subset of distance, but no relationship in another, we could create a variable that just cuts off distance when we think the relationship stops. </div>
airbnb.data$distance_cut <- ifelse(airbnb.data$distance > 1, 1, 
                                   airbnb.data$distance) # Here, we are keeping distances inside of 1
linear_model_cut <- lm(data = airbnb.data, rats ~ distance_cut)
summary(linear_model_cut)

Call:
lm(formula = rats ~ distance_cut, data = airbnb.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47527 -0.06613 -0.06613 -0.06613  0.93387 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.17150    0.09542   12.28   <2e-16 ***
distance_cut -1.10536    0.10069  -10.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.248 on 98 degrees of freedom
Multiple R-squared:  0.5515,    Adjusted R-squared:  0.5469 
F-statistic: 120.5 on 1 and 98 DF,  p-value: < 2.2e-16
ggplot(data = airbnb.data, 
       aes( x = distance_cut, 
            y = rats)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_cartesian(ylim = c(-.1,1.1))

new_data <- data.frame(distance = c(0, 1, 6))
new_data$distance_cut <- ifelse(new_data$distance > 1, 1, 
                                   new_data$distance)
predict(object = linear_model_cut,
        newdata = new_data,
        type = "response") |>
  round(3)
    1     2     3 
1.171 0.066 0.066 

Pre-Class Q13

The whole point of the classes you have with me is learning how to make better predictions. That is especially true for logistic regression, where we are taking something we knew well (linear regression), and adding a lot of complexity.

For that complexity to be worth it, we have to be making better predictions. To see if we are, we can:

  1. Split our data into training and testing sets
  2. Train a regression on the training set
  3. Make predictions on the testing set
  4. See how close those are to our observations

In this question, I want you to split your airbnb.data into a training set of 80 rows, and a testing set of 20. Then, train a logistic regression on the training set, and see how well it predicts the test data.

Click to show code and output
set.seed(123)
#Shuffle airbnb.data
airbnb.data <- airbnb.data[sample(1:nrow(airbnb.data), size = nrow(airbnb.data)),]
train_df <- airbnb.data[1:80, ]
test_df <- airbnb.data[81:100, ]

logistic_model <- glm(data = train_df, rats ~ distance, family = "binomial")
summary(logistic_model)

Call:
glm(formula = rats ~ distance, family = "binomial", data = train_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   3.0339     1.1768   2.578  0.00993 **
distance     -2.5553     0.8314  -3.074  0.00211 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 71.007  on 79  degrees of freedom
Residual deviance: 26.341  on 78  degrees of freedom
AIC: 30.341

Number of Fisher Scoring iterations: 8
test_df$prediction <- predict(logistic_model,
                              newdata = test_df,
                              type = "response")

# MSE
mean((test_df$prediction - test_df$rats)^2)
[1] 0.03454432
# Make a "confusion matrix" by binarizing the DV
test_df$predicted_rats <- ifelse(test_df$prediction >= .5, "Predict Rats", "Predict No Rats")
table(test_df$predicted_rats, test_df$rats)
                 
                   0  1
  Predict No Rats 16  0
  Predict Rats     1  3

Pre-Class Q14

Now, train a linear regression on the training set, and see how well it predicts the test data.

Click to show code and output
linear_model <- lm(data = train_df, rats ~ distance)
summary(linear_model)

Call:
lm(formula = rats ~ distance, data = train_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45942 -0.22120 -0.03261  0.15540  0.73726 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.56308    0.06713   8.388 1.66e-12 ***
distance    -0.10996    0.01605  -6.853 1.50e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2952 on 78 degrees of freedom
Multiple R-squared:  0.3758,    Adjusted R-squared:  0.3678 
F-statistic: 46.96 on 1 and 78 DF,  p-value: 1.501e-09
test_df$prediction <- predict(linear_model,
                              newdata = test_df,
                              type = "response")

# MSE
mean((test_df$prediction - test_df$rats)^2)
[1] 0.06913223
# Make a "confusion matrix" by binarizing the DV
test_df$predicted_rats <- ifelse(test_df$prediction >= .5, "Predict Rats", "Predict No Rats")
table(test_df$predicted_rats, test_df$rats)
                 
                   0  1
  Predict No Rats 17  0
  Predict Rats     0  3

Footnotes

  1. In this example, our hypothesis is that being further from North Beach leads to a decrease in the mention of “rats” in reviews. The null hypothesis is that distance from North Beach has no effect on the mention of rats.↩︎

  2. Because our DV is either 0 or 1, the coefficient can be interpreted as a change in probability of the outcome with a one-unit change in the IV.↩︎